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Abstract: Feature subset selection is a process of selecting a 
subset of minimal, relevant features and is a pre processing 
technique for a wide variety of applications. High dimensional 
data clustering is a challenging task in data mining. Reduced 
set of features helps to make the patterns easier to understand. 
Reduced set of features are more significant if they are 
application specific. Almost all existing feature subset 
selection algorithms are not automatic and are not application 
specific. This paper made an attempt to find the feature subset 
for optimal clusters while clustering. The proposed Automatic 
Feature Subset Selection using Genetic Algorithm (AFSGA) 
identifies the required features automatically and reduces 
the computational cost in determining good clusters. The 
performance of AFSGA is tested using public and synthetic 
datasets with varying dimensionality. Experimental results 
have shown the improved efficacy of the algorithm with optimal 
clusters and computational cost. 

Key words: feature subset selection, Genetic Algorithm and 
clustering. 

I. Introduction 

Clustering is an unsupervised process of grouping 
objects into classes of similar objects. A cluster is a collection 
of objects with high similarity and is dissimilar, to the objects 
belonging to other clusters [1-2]. Clustering is useful in many 
applications such as pattern-analysis, grouping, decision- 
making, and machine-learning situations, including data 
mining, document retrieval and image segmentation [3-4]. 
Hierarchical and Parti tional are the two well known methods 
in clustering. Hierarchical methods construct the clusters by 
recursively partitioning the objects while the partitioning 
methods divide a dataset with or without overlap [5-6]. 

One of the challenges of the current clustering algorithms 
is dealing with high dimensional data. The goal of the feature 
subset selection is to find a minimum set of features such 
that the resulting probability distribution of the data classes 
is as close as possible to the original distribution obtained 
using all features [7] . Mining on a reduced set of features has 
an additional benefit. It reduces the number of features 
appearing the discovered patterns, helping to make the 
patterns easier to understand [7]. 

Most feature selection algorithms are focused on heuris- 
tic search approaches such as sequential search [8], non 
linear optimization [9], and genetic algorithms. Basic heuris- 
tic methods of attribute subset selection include Stepwise 

©2013ACEEE 

DOI: 01. IJRTET.9. 1.560 



forward selection, backward elimination, combination, and 
decision tree induction. Stepwise forward selection starts 
with an empty set of attributes as the reduced set. The best 
of the original attributes is determined and added to the re- 
duced set. At each subsequent iteration or step, the best of 
remaining original attributes is added to the set. Stepwise 
backward elimination starts with the full set of attributes at 
each step; it removes the worst attribute remaining in the set. 
Combination of forward selection and back ward elimination 
selects the best attributes and remove the worst form among 
the remaining attributes. Decision tree induction algorithms 
such as ID3, C4.5, and CART, were originally intended for 
classification. Decision tree induction constructs a flow chart 
like structure where each internal (non leaf) node denotes a 
test on an attribute, each branch corresponds to an outcome 
of the test, and each external (leaf) node denotes a class 
prediction. At each node, the algorithm chooses the "best" 
attribute to partition the data into individual classes [7] . Ferri 
et. al. have proved that Sequential Floating Forward Search 
(SFFS) algorithm was the best among the sequential search 
algorithms [10]. These methods provided solution for fea- 
ture selection as a supervised learning context, and solu- 
tions are evaluated using predictive accuracy[ll]. Among 
these different categories of feature selection algorithms the 
genetic algorithm is a recent development [12]. 

Genetic algorithm approach for feature subset selection 
appears first in 1998[13].The GA is biologically inspired evo- 
lutionary algorithm. It has a great deal of potential in scien- 
tific and engineering optimization or search problems [14]. 
GA can be applicable to feature selection since the selection 
of subset of features is a search problem. The performance of 
GA and classical algorithms have compared by Siedlecki and 
Sklansky [15]. Many literatures were published showing the 
advantages of GA for Feature Selection [16, 1 7] . An unsuper- 
vised learning via evolutionary search for feature selection 
is proposed in 2000 [18]. The authors have used an evolu- 
tionary local selection algorithm to maintain a diverse popu- 
lation of solutions in multidimensional objective space. II- 
Seok Oh et.al. have concluded that no serious attempts have 
been made to improve the capability of GA and they have 
developed Hybrid Genetic Algorithms for Feature Selection 
by embedding the problem specific local search operations 
in a GA. In their work, a ripple factor is used to control the 
strength of local improvement and have shown the supremacy 
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of GA compared to other algorithms in Feature Subset Selec- 
tion [12]. Feng Tan et. al. have proposed a mechanism to 
apply existing feature selection methods on a dataset. The 
feature subsets from these methods are selected as popula- 
tion in GA [19]. Anew genetic algorithm based wrapper fea- 
ture selection method for classification of hyper spectral im- 
age data using SVM is proposed [20] . But for all these algo- 
rithms number of features is a priori i.e. the algorithms are not 
automatic. 

More relevant subset of features can be selected if the 
selection process aimed to application. This paper proposes 
an Automatic Feature Subset Selection using Genetic Algo- 
rithm (AFSGA) for Clustering, which determines subset of 
features automatically while clustering. A new chromosome 
representation is modelled for the problem of feature subset 
selection. The proposed algorithm contains two phases, se- 
lection of optimal initial seeds are determined in the first phase 
and later deals the process of feature subset selection while 
clustering by selecting CS measure as the fitness function. 
Efficiency of the algorithm is studied by selecting various 
public and synthetic datasets. The following sections are 
divided into scientific background, Genetic Algorithm, Auto- 
matic Feature Subset Selection using Genetic Algorithm 
(AFSGA), experimental results and conclusion. 

II. Scientific Background 

A. Problem Definition 

A data object can be distinguished from others by a col- 
lective set of attributes called features, which together repre- 
sent a pattern [12]. Let P = {P p P,,...,P n ) be a set of n data 
points, each having d features. These patterns can also be 
represented by a data matrix X nxd with n d-dimensional row 
vectors. The i th row vector X characterizes the i th object from 
the set P, and each element X. . in X. corresponds to the j th 
feature (j = l,2,...,d)ofthei ,h data object (i=l,2,...,n). 

■■■ ■" ^id" 

■^l "" ^nf — -^fid- 
Given such an X matrix, a partitional clustering algorithm 
tries to find a set of partitions C = {C r C 2 ,...,C K ) of K classes, 
such that the similarity of the data objects in the same cluster 
is maximum and data objects from different clusters differ as 
far as possible. The partitions should maintain three 
properties [21]. 

1) Each cluster should have at least one data object 
assigned, 

i.e, Cj!=* Vi E {l^,...^}. 

2) Two different clusters should have no data object in 
common, 

i.e. : Cj n Cj = * V l != j andij 6 {1,2,...^}. 



3) Each data object should be attached to exactly one cluster 
only i.e. 

r=l 

B. Similarity Measure 

The dissimilarity between the objects can be computed 
based on the distance between each pair of objects. The 
most popular distance measure is Euclidean distance. The 
Euclidean Distance between objects X. and X^ is given by: 

d(x P x)= JZ(X*-^) 2 

where X , and X , are the k th coordinates of X and X 

IK JK 1 J 

respectively. 

EI. Genetic Algorithm 

Genetic Algorithm is one of the nature inspired stochas- 
tic evolutionary optimization algorithm can produce competi- 
tive solutions for a wide variety of problems [22] . GA main- 
tains a set of solutions called population. Each solution vec- 
tor is a chromosome. Biological evolution is a process of 
selecting survival individuals for the next generation. Sur- 
vival individuals are the fittest chromosomes those are gen- 
erated from the crossover and mutation genetic operations. 
Having found the candidate solutions (parents) the cross- 
over takes place, where the parent's genetic information in- 
volved in generating new offspring (children individual) [22] . 
The mutation operation is applied to the offspring popula- 
tion, according to a very small probability; some of the new 
individuals will suffer mutation (a random and small change 
to its genetic material information). A new fitness value is 
calculated to the individuals that have suffered mutation. 
The generations will be continued with the calculation of 
new offspring till a stopping criterion is checked [23]. The 
criterion is a certain value of the fittest chromosome in the 
population or a maximum number of generations or process- 
ing time elapsed. 

IV. Automatic Feature Subset Selection Using Genetic 
Algorithm For Clustering 

Automatic Feature Subset Selection using Genetic 
Algorithm (AFSGA) proposes a Genetic Algorithm based 
feature selection method to cluster the data. The method 
contains two steps; first step finds the optimal initial centroids 
using CS measure. Finding optimal clusters while selecting 
features based on GA is the second step. First step selects n/ 
10 sets of centroids randomly and selects the best using CS 
measure, where n is number of elements in dataset. Second 
step constructs a GA based algorithm to find the minimal set 
of required features for clustering. 
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A. Selection of Optimal Initial Centroids 

AFSGA determines optimal initial centroids by running 
k-means algorithm n/10 times. 

Algorithm InitCentroids(DATA dataset, K no. Of clusters) 
{ 

Min=0; 
fori=l ton/10 
{ 

C = k-means(DATA,K); 

csm=CSMeasure(C,DATA); 

if(min>csm) 

{ 

min=csm; 
initcentroid=C; 

} 

} 

return initcentroid 

} 

The procedural steps and representation of the chromo- 
some to determine the necessary features from the given 
dataset are as follows. 

B. Chromosome Representation 

Single chromosome is represented by a single bit vector 
structure. A vector of size D (number of features) is the 
chromosome in AFSGA. Each bit represents the activation 
status of the feature. One indicates the feature activation 
while zero states the inactiveness in the clustering process. 
Each bit is a gene, a set of genes makes a chromosome which 
represents a set of features necessary for clustering. The 
conceptual model of the chromosome is in the fig Nol . 



else 



I 1 I i I o I i 1 



-J 



D 

Fig. 1. Chromosome Representation 

In the figure the chromosome is with D (6) features, among all 
the 1,3,4 and 6 features are active for clustering. 
A GA based algorithm contains a set of solutions 
(chromosomes) called population. Here the population size 
is selected as n number of data objects. 

B. Population Initialization 

Each bit is initialised either with one or zero based on a 
generated random number. If the random number is lesser 
than 0.5, set the bit value as one otherwise set to zero. The 
algorithm is as follows: 
Algorithm InitPop( D dataset size) 
{ 

for ( i = 1 to n ) 

for ( each feature f in i Ul chromosome ) 
{ 

if (rand (1) <0.5) 

f=i; 
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f=0; 



The sample population generated by the above algorithm 
with D (6) features and n (5) data objects is shown in the 
following fig. No 2. 




Fig. 2. Sample Population 

C. Selection of Parents 

Parents or candidate solutions are selected randomly from 
the current population. 

Algorithm SelectParent(i:CurrentChromosome, p:Population 

size) 
{ 

v=randperm(popsize) ; 

j=i; 

while j<=2 
{ 

ifv(l)~=i 
{ 

parent(j)=v(l); 

j=j+i; 

} 

v=v(l,2:length(v)); 



D. Crossover and mutation operators 

The principle of applying genetic operators is to change 
the chromosomes in successive generations until stopping 
criterion is met. Crossover and mutation operators are the 
two genetic operations in genetic algorithms. Here the 
proposed algorithm uses single point crossover as the 
crossover operation. In single point crossover, two parents 
are selected randomly from the current population. Select a 
value v between 1 and D. Form a new offspring by combining 
the feature bits 1 to v from parent 1 and the feature bits v+1 to 
d from parent 2. Example for single-point crossover is shown 
in Fig. No. 3 
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Fig. 3. Single Point Crossover 

Mutation operation is applied on each new offspring using 

the following algorithm. Mutation rate P m is the input 

parameter with the value 0. 1 . The sample mutation is shown 
in fig. No. 4 
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Algorithm Mutation (O offspring, P mutation rate) 



Let n Q and n be the number of zero bits and one bits 



respectively in the offspring. 
P =P ;P =P xn,/n n ; 

1 nr m 1 

for ( each feature f in the chromosome ) 



Generate a random number r within the range [0,1]. 
if(f=l andr<P,) 

convert f to 0; 
else if ( f=0 and r<P Q ) 

convert f to 1; 




Mutation 
1 | | | | D | | 



Fig. 4. Mutation Example 

E. Fitness Function 

The quality of the clustering solution is assessed using 
cluster validity measures. Most of the cluster validity 
measures are the ratio of intra cluster distance and inter cluster 
distance. The proposed algorithm selects CS measure as the 
fitness function. CS measure is one of the cluster validity 
measure developed based on compactness and separation 
of clusters in a clustering solution [24]. 

Chou et al. (2004) have proposed the CS measure for 
evaluating the validity of a clustering scheme [24]. The 
centroid of a cluster is computed by averaging the elements 
that belong to the same cluster using 



m- 



CS 



1 

N 



max 

LeC, 



d ( X t , X q )} 



min { d (m t , m ) } 

jsk , j* i 



CS measure is a function of the ratio of the sum of within- 
cluster distance to between -cluster distance. The cluster 
configuration that minimizes CS is taken as the optimal 
solution. 

F. Stopping Criterion 

The proposed algorithm selects stagnation by 
convergence as the stopping criterion. The difference of 
fitness value of fittest individuals in any two successive 
generations is less than 0.0001 is the convergence criterion 
for the proposed algorithm. 

The AFSGA for feature selection: The AFSGA selects 
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minimum number and relevant features for clustering. 

Input: Data set with n objects each contains D number of 

features, k number of clusters, mutation rate P 

Output: Best chromosome with limited or minimal number of 

features. 

Procedure: 

Step 1: Determining initial centroids 

1.1 Select n/10 sets of centroids C i; C ,.. . C n randomly from 
the dataset. 

1.2 For each data object, find the centroids C nearest it. Put 
the data object in the cluster identified with this nearest 
centroid. 

1.3 Evaluate the quality of each clustering solution obtained 
in the previous step using CS measure. 

1.4 The centroids C , with minimum CS measure is selected as 

i 

the initial centroids 

Step 2: Generate initial population of size n using algorithm 
InitPop. 

Step 3: Evaluate fitness of each chromosome using CS 
measure, fitness function 

Step 4: Generate new offspring for each chromosome applying 

crossover on the selected candidate individuals 

Step 5: Apply mutation operation on each new offspring 

obtained in the previous step 

Step 6: Evaluate fitness of each new offspring. 

Step 7: Repeat the steps4 to step 6 until difference of fitness 

value of fittest individuals in any two successive generations 

is less than 0.0001. 

V. Experimental Results 

The experiments are conducted on three public datasets 
with variable sample space. 
The real data sets used [25]: 

1. Iris plants database (n = 150, D = 4, k = 3): This is a well- 
known database with 4 inputs, 3 classes, and 150 data vectors. 
The data set consists of three different species of iris flower: 
Iris setosa, Iris virginica, and Iris versicolour. 

2. Glass (n = 214, D = 9, k = 6): The data were sampled from six 
different types of glass. 

3. Wine (n = 178, D = 13, k = 3): The wine data are the results 
of a chemical analysis of wines grown in the same region in 
Italy but derived from three different cultivars. The analysis 
determined the quantities of 13 constituents found in each of 
the three clusters of wines. 

4. Synthetic datasetl (n=450, D=15, k= 3). 

5. Synthetic dataset2 (n=850, D=20, k= 3). 

The algorithm is executed for 40 times on each dataset. 
The fittest chromosome is identified in each independent 
run. The features observed in most of the fittest chromosomes 
are the minimal features. The average value of the fittest 
chromosome in 40 independent run and the number of features 
are tabulated in table 1 , comparing with the k- means clustering 
results. The results demonstrated that the proposed algorithm 
generates more optimal clustering solution with minimal 
features. 
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Table I: Performance evaluation of AFSGA 



Dataset 


Algorithm 


CS measure 


Number 

of 
features 


% r»f Hntn 
IV KJL Lid. La 

reduced 


Iris 


K-means 


0.1281 


4 





AFSGA 


0.0608 


3 


25 


Wine 


K-means 


0.2550 


13 





AFSGA 


0.1773 


11 


15.38 


Glass 


K-means 


0.3733 


9 





AFSGA 


0.4355 


8 


11.11 


Synthetic 1 


K-means 


0.06189 


15 





AFSGA 


0.0361 


13 


13.33 


Synthetic2 


K-means 


0.0763 


20 





AFSGA 


0.0676 


13 


35 



VI. Conclusion and Future Work 

Feature subset selection is the problem of selecting a 
subset of features based on some optimization criterion. 
AFSGA selects subset of features based on CS measure using 
genetic algorithm for the process of clustering. The results 
of AFSGA are compared with the classical clustering 
algorithm. The results have demonstrated the improved 
efficiency of AFSGA compared to k-means. Differential 
evolution (DE) is one of the most powerful stochastic real- 
parameter optimization algorithms in current use, which takes 
negligible input number of parameters compared to GA [26] . 
Generating feature subset algorithm using differential 
evolution is our upcoming work. 
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